f(x1, x2) = (x1 + x2)^2
Assume that x1, x2 ~ U[-1,1] and x1=x2 (full dependency)
Calcualte PD for x1 Assume that x1, x2 ~ U[-1,1] and x1=x2 (
$$E_{x_2}[x_2] = 0 $$$$E_{x_2}[x_2^2] = \frac{1}{3}$$$$g_{PD}(z)=E_{x_2}[(z+x_2)^2] = E_{x_2}[z^2+2zx_2+x_2^2)^2] = z^2 + 2zE_{x_2}x_2+ E_{x_2}x_2^2 = z^2 + \frac{1}{3} $$In this homework we used brain stroke dataset from kaggle.
I prepared the data (onehot encoding) and first trained XGBClassifier model.The model achieved 94,6% accuracy.
I selected some observations. All predictions were 0.
Calculated what-if explanations. The results are as follows
We can clearly see how on those few example increasing an age increases brain stroke prediction.
When compared indexes 8 and 15 on CP of avg_glucose_level we can observe that prediction increases in different value range. In one sample it is between 160-190 and in the second between 220 and 250.
Comparing CP with PDP. Results below:
In both age and bmi PDP plots we observe how blue lines look like the same function as shadows (CP) but shifted. We also observe that bmi have very small impact on prediction.
In line with the intuition PDP functions for LogisticRegression look like smoothed PDP functions for XGBClassifier.
PDP for XGBClassifier:
PDP for LogisticRegression:
CP an PDP provides good insight of how attributes contribute to final prediction. They seems a good complement to lime and shap methods. First we can indicate which attributes are the most important and then we can vizualize how change in values impacts prediction.
!pip install dalex shap catboost lime
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import pandas as pd
import lime
import dalex as dx
import shap
df = pd.read_csv("brain_stroke.csv")
df
Dataset consists of features that are nonnumerical. Let's hot encode them first.
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
df_one_hot = pd.get_dummies(df, drop_first = True)
df_one_hot
Preparing train and test samples.
X = df_one_hot.drop("stroke", axis=1)
y = df_one_hot.stroke
X_train, X_test, y_train, y_test =train_test_split( X, y, test_size=0.2, random_state=42)
X_train
xgb=XGBClassifier()
xgb.fit(X_train,y_train)
xgb_predict=xgb.predict(X_test)
xgb_acc=accuracy_score(xgb_predict,y_test)
model = xgb
xgb_acc
The model achieved 94,6% accuracy.
observations = [4, 8, 15, 16, 34, 42]
predictions = model.predict(X_train.iloc[observations])
predictions
explainer = dx.Explainer(model, X_test, y_test)
explainer.model_performance(cutoff=y_train.mean())
observations_data = X_train.iloc[observations]
explainer.predict(observations_data)
explainer.model_parts().result
cp = explainer.predict_profile(new_observation=X.iloc[observations])
cp.plot(variables=["age", "bmi"])
cp.plot(variables=["avg_glucose_level"])
cp_8_15 = explainer.predict_profile(new_observation=X.iloc[[8,15]])
cp_8_15.plot(variables=["avg_glucose_level"])
pdp = explainer.model_profile()
pdp.result
pdp.plot(variables=["age", "bmi"])
pdp.plot(variables=["age", "bmi"], geom="profiles", title="Partial Dependence Plot with individual profiles")